tree, randomForest, gbm, and xgboostwine-quality.csv dataset from Canvas.Load the winequality-data.csv data from the course Canvas website. Note that the last column is the ID of each sample and should be removed from further analysis.
data <- read.csv("winequality-data.csv", header = TRUE)
Split the dataset into 50% training and 50% testing.
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
inTrain <- createDataPartition(data[,"quality"], p = 0.5)[[1]]
wine.train <- data[inTrain,-13]
wine.test <- data[-inTrain,-13]
Build a decision tree model to predict wine \(quality\) using the 11 other features using the tree() function in the tree package and visualize the results
library(tree)
tree.model <- tree(quality ~ ., data = wine.train)
plot(tree.model)
text(tree.model)
## cat("\n")
Use the single decision tree model to predict wine quality in the test set and calculate the residual sum of squares error.
# Use the single decision tree to predict test data
tree.test <- predict(tree.model, newdata = wine.test)
# RSS - residual sum of squares
RSS <- sum((tree.test - wine.test$quality)^2)
# MSE - mean squared error
MSE <- mean((tree.test - wine.test$quality)^2)
print(RSS)
## [1] 1126.1
print(MSE)
## [1] 0.5748341
Use the randomForest package for this question.
Implement a random forest classifier trained on the training data from question 2 and use it to predict the test data.
library(randomForest)
rf.model <- randomForest(quality ~. , data = wine.train)
rf.test <- predict(rf.model, newdata = wine.test)
Calculate the RSS on the test set prediction and compare it to the result of the single decision tree from question 2.
RSS.rf <- sum((rf.test - wine.test$quality)^2)
# Test set RSS from random forest
print(RSS.rf)
## [1] 835.2821
There are two parameters to tune in the randomForest model — number of trees to use (ntree) and the number of variables to consider at each split (mtry). Find the ntree value at which adding more trees does not improve performance.
# Trying different values of ntree
RSS.test <- c()
for(i in c(100,500,1000,1500, 2000, 5000)) {
rf.model <- randomForest(quality ~. , data = wine.train, ntree = i)
rf.test <- predict(rf.model, newdata = wine.test)
RSS.rf <- sum((rf.test - wine.test$quality)^2)
RSS.test <- c(RSS.test, RSS.rf)
}
plot(c(100,500,1000,1500,2000,5000), RSS.test, type = 'b', xlab = "ntrees", ylab = "RSS")
gbm fittingUse the gbm package for this question. Implement a boosted tree model trained on the training data from question 2 and use it to predict the test data. What is the RSS on the test set prediction and how does it compare to the random forest model? Again, try different number of trees.
library(gbm)
RSS.test.gbm <- c()
for(i in c(100,500,1000,1500, 2000, 5000)) {
gbm.model <- gbm(quality ~ ., data = wine.train,
distribution = "gaussian", n.trees = i)
gbm.test <- predict(gbm.model, newdata = wine.test, n.trees = i)
RSS.gbm <- sum((gbm.test - wine.test$quality)^2)
RSS.test.gbm <- c(RSS.test.gbm, RSS.gbm)
}
# Should notice the RSS is higher for the gbm model compared to random forest
plot(c(100,500,1000,1500, 2000, 5000), RSS.test.gbm, type = 'b', xlab = "ntrees", ylab = "RSS")
xgboost fittingUse the xgboost package for this question. Implement a boosted tree model trained on the training data from question 2 and use it to predict the test data. Use max_depth = 2 and plot the RSS for nrounds = 300, 500, 1000, 1500 and eta = 0.01, 0.02, 0.03. Check that a small value for eta is required for larger value of nrounds.
library(xgboost)
RSS.test.xgb <- matrix(0, nrow = 4, ncol = 3)
j = 1
for(eta in c(0.01, 0.02, 0.03)) {
i = 1
for(nrounds in c(300,500,1000,1500)) {
xgb.model <- xgboost(data = as.matrix(wine.train[,-12]),
label =wine.train[,12], nrounds = nrounds,
max_depth = 2, eta = eta, verbose = FALSE)
xgb.test <- predict(xgb.model, newdata = as.matrix(wine.test[,-12]))
RSS.xgb <- sum((xgb.test - wine.test$quality)^2)
RSS.test.xgb[i, j] <- RSS.xgb
i = i + 1
}
j <- j + 1
}
plot(x = c(300,500,1000,1500), y =RSS.test.xgb[,1], col = "red", type = "b", ylim = c(950,1230),
xlab = "N trees", ylab = "RSS")
points(x = c(300, 500,1000,1500), y =RSS.test.xgb[,2], col = "green", type = "b")
points(x = c(300, 500,1000,1500), y =RSS.test.xgb[,3], col = "blue", type = "b")
legend("topright", legend = paste0("eta = ", c(0.01, 0.02, 0.03)), col = c("red", "green", "blue"), lty = "solid", pch = 21)